Med-cDiff: Conditional Medical Image Generation with Diffusion Models

Conditional image generation plays a vital role in medical image analysis as it is effective in tasks such as super-resolution, denoising, and inpainting, among others. Diffusion models have been shown to perform at a state-of-the-art level in natural image generation, but they have not been thoroughly studied in medical image generation with specific conditions. Moreover, current medical image generation models have their own problems, limiting their usage in various medical image generation tasks. In this paper, we introduce the use of conditional Denoising Diffusion Probabilistic Models (cDDPMs) for medical image generation, which achieve state-of-the-art performance on several medical image generation tasks.


Introduction
Conditional image generation refers to the generation of images using a generative model based on relevant information, which we denote as a condition.When the condition is an image, this is also referred to as image-to-image translation.In the medical domain, this has many important applications such as super-resolution, inpainting, denoising, etc., which can potentially improve healthcare [1].Super-resolution can help shorten imaging time and improve imaging quality.Denoising helps clinicians and downstream algorithms to make better diagnostic judgments.Medical image inpainting can be beneficial to anomaly detection.
Existing generative models are able to perform some of these jobs decently; e.g., the Hierarchical Probabilistic UNet (HPUNet) [2] for ultrasound image inpainting, and the progressive Generative Adversarial Network (GAN) [3] and SMORE [4] for medical image super-resolution.These methods work to some extent, but they are tailored to specific applications or imaging modalities, making it difficult for researchers to adapt them to different tasks or modalities.MedGAN [5] and UP-GAN [6] target general-purpose medical image generation; however, they are too challenging to train and/or produce underwhelming results.
Models based on Variational Autoencoders (VAE) can be effective in some medical applications [2,7], but the generated images tend to be blurry [8].Although GAN-based models can generate high-quality medical images [5,9], they suffer from unstable training due to vanishing gradient, convergence, and mode collapse [10].Normalizing Flow (NF), which has also been used in medical imaging [11,12], can estimate the exact likelihood of the generated sample, making it suitable for certain applications; however, NF requires specifically designed network architectures and the generated image quality fails to impress.Diffusion models have been dominant in natural image generation due to their ability to generate high-fidelity realistic images [13][14][15][16].They have also been applied to medical image generation [17][18][19][20], such as in super-resolution medical imaging [21], but there are only a limited number of studies using conditional diffusion models.
We propose a conditional Denoising Diffusion Probabilistic Model (cDDPM), which we call the medical conditional diffusion model (Med-cDiff), and apply it to a variety of medical image generation tasks, including super-resolution, denoising, and inpainting.In a series of experiments, we show that Med-cDiff achieves state-of-the-art (SOTA) generation performance on these tasks, which demonstrates the great potential of diffusion models in conditional medical image generation.

Related Work
Before diffusion models became popular in medical image analysis or in mainstream computer vision, GANs [22] were the most popular image generation methods.Developed to perform conditional natural image generation, Pix2PixGAN [23] was adapted to medical imaging and several researchers have shown its usefulness in such tasks [24][25][26][27].Zhu et al. [28] proposed CycleGAN to perform conditional image-to-image translation between two domains using unpaired images, and the model has also been extensively used in medical imaging.Du et al. [29] made use of CycleGAN in CT image artifact reduction.Yang et al. [30] used a structure-constrained CycleGAN to perform unpaired MRI-to-CT brain image generation.Liu et al. [31] utilized multi-cycle GAN to synthesize CT images from MRI for head-neck radiotherapy.Harms et al. [32] applied CycleGAN to image correction for cone-beam computed tomography (CBCT).Karras et al. [33] proposed StyleGAN, which has an automatically learned, unsupervised separation of high-level attributes and stochastic variation in the generated images, enabling easier control of the image synthesis process.Fetty et al. [34] manipulated the latent space for high-resolution medical image synthesis via StyleGAN.Su et al. [35] performed data augmentation for brain CT motion artifacts detection using StyleGAN.Hong et al. [9] introduced 3D StyleGAN for volumetric medical image generation.Other GAN-based methods have also been proposed for medical imaging.Progressive GAN [3] was used to perform medical image super-resolution.Upadhyay et al. [6] extended the model by utilizing uncertainty estimation to focus more on the uncertain regions during image generation.Armanious et al. [5] proposed MedGAN, specific to medical image domain adaptation, which captured the high and low frequency components of the desired target modality.
Apart from GANs, other generative models, including VAEs and NFs, are also popular in image generation.The VAE was introduced by Kingma and Welling [36], and it has been the basis for a variety of methods for image generation.Vahdat and Kautz [37] developed Nouveau VAE (NVAE), a hierarchical VAE that is able to generate highly realistic images.Hung et al. [2] adapted some of the features from NVAE into their hierarchical conditional VAE for ultrasound image inpainting.Cui et al. [38] adopted NVAE in positron emission tomography (PET) scan image denoising and uncertainty estimation.As for the NF models, Grover et al. [39] proposed AlignFlow based on a similar concept with NF models instead of GANs.Bui et al. [40] extended AlignFlow into medical imaging for Unpaired multi-contrast MRI conditonal image generation.Wang et al. [41] and Beizaee et al. [42] applied NF to medical image harmonization.
In recent years, diffusion models have become the most dominant algorithm in image generation due to their ability to generate realistic images.On natural images, diffusion models have achieved SOTA results in unconditional image generation by outperforming their GAN counterparts [13,14].Diffusion models have achieved outstanding performance in tasks such as super-resolution [16,43], image editing [44,45], and unpaired conditional image generation [46], and they have attained SOTA performance in conditional image generation [15].In medical imaging, unsupervised anomaly detection is an important application of unconditional diffusion models [17,[47][48][49].Image segmentation is a popular application of conditional diffusion models, where the image to be segmented is used as the condition [19,[50][51][52][53]. Diffusion models have also been widely applied to accelerating MRI reconstruction [20,54,55].Özbey et al. [18] used GANs to shorten the denoising process in diffusion models for medical imaging.

Background
The goal of conditional image generation is to generate the target image x 0 given a correlated conditional image y.Diffusion models consist of two parts: a forward noising process q, and a reverse denoising process p θ parameterized by θ. Figure 1 illustrates conditional diffusion models.At a high level, given y, they sample from a data distribution during p θ , reversing q, which adds noise iteratively to the original image x 0 .More specifically, the sampling process starts with a random noise sample x T , and iteratively generates less-noisy samples, x T−1 , x T−2 , . . ., based on the conditional image y for T steps until reaching the final output sample x 0 .For a specific sample x t during the process, the larger t is, the more noisy the sample will be.Given the conditional image y, the reverse process p θ learns to denoise the sample x t by one step to x t−1 .The forward process q is a Markovian noising process, where Gaussian noise is added to the image x t−1 at each time step t = 1, 2, ..., T according to a variance schedule β t : where N (•) denotes the normal distribution and I is the identity matrix.Note that where T is the number of steps.The forward noising process (1) can be used to sample x t at any timestep t in closed form.In other words, since then for the original image x 0 and any given timestep t where α t = 1 − β t and ᾱt = ∏ t i=1 α i , and ∼ N (0, 1).When T is large, we can assume that x T ∼ N (0, I), which is random Gaussian noise containing no information regarding the original image x 0 [13].
In a conditional diffusion model, the objective is to learn the reverse process p θ so that we can infer x t−1 given x t and the conditional image y.In this way, starting from the Gaussian noise x T ∼ N (0, 1), and given y, we can iteratively infer the sample at time step t − 1 from the sample at time step t until we reach the original image x 0 .For the reverse process, The reverse process can therefore be parameterized as where we set Σ θ (x t , y, t) = σ 2 t I.As for µ θ (x t , y, t), Ho et al. [13] showed that it must be parameterized as where θ (x t , y, t) is a function approximating .For a total of T steps, the training objective is to minimize the variational lower bound on the negative log-likelihood: = L(θ).
More efficient training can be achieved by optimizing random terms in the training objective L(θ) using stochastic gradient descent.Therefore, we can rewrite the training objective as where and D KL (. .) is the Kullback-Leibler (KL) divergence between two distributions.In (10), the term q(x t |x t+1 , x 0 ) is given by where

Training and Sampling
When t = T, L T (θ) is a constant with no learnable parameters since β t is fixed to a constant.Therefore, L t (θ) can be ignored during training.
When 0 < t < T, L t (θ) can be expressed as where C is a constant.When t = 0, assuming all the image data have been re-scaled to [−1, 1], the expression of L 0 (θ) can be written as where H and W are the height and width of the image, respectively, and δ is a small number, and where From Equations ( 16) and ( 17), we see that the training objective is differentiable with respect to the model parameter θ.During each training step, we sample the image pair (x 0 , y) from the dataset x 0 , y ∼ p data (x, y), the time step t from a uniform distribution t ∼ U ({1, 2, ..., T}), and from a normal distribution ∼ N (0, I).We then perform gradient descent on which is an alternative variational lower bound that has been shown to be better for sampling quality [13].
During sampling, x T is first sampled from a normal distribution x T ∼ N (0, I).Then we iteratively sample x T−1 , x T−2 , . . ., x 0 from distribution x t−1 ∼ p θ (x t−1 |x t , y) by where σ t is an untrained time dependent constant and z ∼ N (0, I).

Experiments 4.1. Datasets
Our method is evaluated on the following datasets:

X-ray Denoising:
The public chest X-ray dataset [57] contains 5863 X-ray images with pneumonia and normal patients.Overall, 624 images were used for testing.Pneumonia patients were further categorized as virus-or bacteria-infected patients.
We randomly added Gaussian noise as well as salt and pepper noise to the images and used the original images as the ground truth.

3.
MRI Inpainting: The dataset consists of 18,813 T1-weighted prostate MRI images that were acquired by the Spoiled Gradient Echo (SPGR) sequence.We used 6271 of them for testing.The masks were randomly generated during training, and they were fixed among different tests for testing.

Implementation and Evaluation Details
For Med-cDiff, θ (x t , y, t) was parameterized by a U-Net [58] while using group normalization [59].The total number of steps was set to T = 2000.The forward process variances were set to constants that linearly increase from β 1 = 10 −4 and β T = 0.02.We also set σ 2 t = β t .All the images used were resized to 128 × 128, and the pixel values are normalized to the range [−1, 1] in a patient-wise manner.The models were all trained for 2 × 10 5 iterations with a learning rate of 1 × 10 −4 .
Due to the domain gaps [66,67] between different datasets and different tasks, combining datasets and training a combined network would yield a worse performance than separately training the networks.Thus, we trained and tested our methods on different tasks separately.

MRI Super-Resolution
For MRI super-resolution, we downsampled the MRI images by a factor of 2 √ 2, 4, and 4 √ 2, and then we upscaled the images to their original size.We compared the performance of Med-cDiff against bilinear interpolation, pix2pixGAN [23], and SRGAN [68] both visually and quantitatively, evaluated by LPIPS, FID, and accutance, as well as performance comparison on the downstream zonal segmentation task.
Figure 2 shows qualitative results.Clearly, images generated by the other methods are blurry and lack realistic textures, whereas Med-cDiff is able to recover the shape of the prostate as well as relevant textures.For zonal segmentation, we utilized the pretrained CAT-nnUNet [69] and calculated the 3D patient-wise DSC for evaluation.The quantitative results are reported in Table 1, confirming that the images generated by Med-cDiff are the most realistic with the best sharpness and are useful in downstream zonal segmentation.Furthermore, to show the effectiveness of Med-cDiff on zonal segmentation, we further downsampled the original images by a factor of 8, 8 √ 2, and 16 and performed MRI super-resolution.The results on downstream zonal segmentation are plotted in Figure 3, which reveals that Med-cDiff clearly outperforms bilinear interpolation and pix2pixGAN.CAT-nnUNet performs similarly on images generated by Med-cDiff and SRGAN for PZ segmentation, but it performs better on images generated by Med-cDiff for TZ segmentation.The segmentation performance using bilinear interpolation and pix2pixGAN drops drastically as the upscaling factor increases, while the segmentation performance using images generated by SRGAN and Med-cDiff does not decrease much.

X-ray Denoising
We evaluated the denoising results using the LPIPS and FID metrics, and further evaluated the results by comparing the downstream classification performance, where 3-class classification (normal/bacterial pneumonia/viral pneumonia) was performed using VGG11 [70].We compared Med-cDiff against pix2pixGAN [23] and UP-GAN [6].
The quantitative results are reported in Table 2. Med-cDiff outperforms the other methods in every metric.Qualitative results are shown in Figure 4, where we see that pix2pixGAN creates new artifacts and distorts the anatomy while UP-GAN creates unrealistic blurry images lacking details.More specifically, in the normal image example in Figure 4, the yellow arrows point to the newly generated artifacts, and the red arrows point to the unusually large spinal cord.By contrast, Med-cDiff generates realistic patterns in those regions.In the viral pneumonia example, pix2pixGAN cannot generate the bright pattern in the original image at the yellow arrow.As for the bacterial pneumonia example, pix2pixGAN cannot generate the spinal cord with the correct shape at the yellow arrow.In both pneumonia examples, pix2pixGAN failed to recover the correct shape of the ribs at the red arrows.

MRI Inpainting
We compared our method against other inpainting methods such as pix2pixGAN, HPUNet [2], and UP-GAN using the LPIPS and FID metrics.Furthermore, we performed a 2AFC paradigm [65] to measure how well trainees can discriminate real images from the generated ones.We randomly sampled 50 real and generated image pairs from the test set for each method and asked four trainees to perform 2AFC.We averaged the results from the four trainees.
The quantitative results in Table 3 reveal that Med-cDiff can generate the most realistic images.The 2AFC values convey that it is difficult to determine that images generated by Med-cDiff are not real, while it is easy to discern the inauthenticity of images generated by competing methods.The visual results in Figure 5 further confirm that Med-cDiff generates the most authentic images.More specifically, in the masked regions, pix2pixGAN generates unrealistic patterns that are clear indicators of images generated by GANs, while HPUNet can generate somewhat realistic patterns, although the generated patches are still relatively blurry.HPUNet was designed for ultrasound image inpainting, but the performance is unimpressive when applied to MRI images.This shows the difficulties in applying some methods to cross-imaging modalities.As for UP-GAN, the generated patches were blurry, while Med-cDiff generated realistic patterns and contents.

Conclusions
We have introduced Med-cDiff, a conditional diffusion model for medical image generation, and shown that Med-cDiff is effective in several medical image generation tasks, including MRI super-resolution, X-ray image denoising, and MRI image inpainting.We have demonstrated that Med-cDiff can generate high-fidelity images, both quantitatively and qualitatively superior to those generated by other GAN-and VAE-based methods.The images generated by Med-cDiff were also tested in downstream tasks such as organ segmentation and disease classification, and we showed that these tasks can benefit from the images generated by Med-cDiff.
More importantly, Med-cDiff was not designed for any specific application yet it outperforms models designed for specific applications.For example, SRGAN is specifically designed to generate high-resolution images from low-resolution images as it upsamples the low-resolution images within the network, while HPUNet is mainly used for inpainting ultrasound images to generate realistic ultrasound noise patterns.By contrast, since conditional diffusion models can generate highly realistic images, Med-cDiff can learn to generate various medical images with different characteristics and patterns.
In future work, we will apply Med-cDiff to other downstream tasks; e.g., anomaly detection and faster image reconstruction.Conditional medical image generation is not limited to these tasks.Other applications, such as inter-modality image translation and image enhancement, are also worthy of exploration.

Figure 1 .
Figure 1.A graphical model representation of conditional diffusion models.The blue and green arrows indicate the forward and reverse processes, respectively.

Figure 2 .
Figure 2. Qualitative comparison of Med-cDiff against other super-resolution methods.

Figure 3 .
Figure 3. DSC comparison of Med-cDiff against bilinear interpolation, pix2pixGAN, and SRGAN for zonal segmentation.The purple dotted lines indicate scores from the original high-resolution images.

Figure 4 .
Figure 4. Qualitative comparison of Med-cDiff against other denoising methods.Arrows point to regions that pix2pixGAN cannot correctly generate.

Table 1 .
Numerical comparison of Med-cDiff against other super-resolution methods.

Table 2 .
Quantitative comparison of Med-cDiff against other denoising methods.

Table 3 .
Quantitative comparison of Med-cDiff against other inpainting methods.